library(readr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(caret)
library(plotly)
library(randomForest)
library(nnet)

Some more EDA - Seagrass species in the Great Barrier Reef, 1984-2018

We will now focus our attention on seagrass, and aim to build a classification algorithm to predict presence or absence of any seagrass species in the Great Barrier Reef. We will use location, types of sediment, and types of seabed to predict the presence of four seagrass species: Cymodocea serrulata, Syringodium isoetifolium, Thalassia hemprichii, and Zostera muelleri (subspecies capricorni).

# read data
setwd("cleaned_data")
seagrass <- read.csv("seagrass_classification_data.csv", as.is =TRUE)

# factorize relevant variables
seagrass$SPECIES <- as.factor(seagrass$SPECIES)
seagrass$SEDIMENT <- as.factor(seagrass$SEDIMENT)
seagrass$TIDAL <- as.factor(seagrass$TIDAL)

# view data
head(seagrass)
##      SPECIES  LATITUDE LONGITUDE DEPTH SEDIMENT      TIDAL
## 1 Z_CAPIRCOR -23.65857  151.1206     0      Mud intertidal
## 2 Z_CAPIRCOR -23.65840  151.1212     0      Mud intertidal
## 3 Z_CAPIRCOR -23.65898  151.1203     0      Mud intertidal
## 4 Z_CAPIRCOR -23.65970  151.1197     0      Mud intertidal
## 5 Z_CAPIRCOR -23.65884  151.1205     0      Mud intertidal
## 6 Z_CAPIRCOR -23.65993  151.1197     0      Mud intertidal
summary(seagrass)
##        SPECIES         LATITUDE        LONGITUDE         DEPTH        
##  C_SERRULAT: 1256   Min.   :-24.20   Min.   :142.5   Min.   : 0.0000  
##  S_ISOETIFO:  101   1st Qu.:-23.79   1st Qu.:146.8   1st Qu.: 0.0000  
##  T_HEMPRICH:  823   Median :-22.41   Median :150.5   Median : 0.0000  
##  Z_CAPIRCOR:10201   Mean   :-21.06   Mean   :149.0   Mean   : 0.9253  
##                     3rd Qu.:-19.17   3rd Qu.:151.3   3rd Qu.: 1.2279  
##                     Max.   :-10.75   Max.   :151.9   Max.   :29.3583  
##                                                                       
##    SEDIMENT           TIDAL     
##  Gravel:   1   intertidal:7433  
##  Mud   :8969   subtidal  :4948  
##  Reef  :  63                    
##  Rock  :  66                    
##  Rubble: 107                    
##  Sand  :3151                    
##  Shell :  24

First we plot seagrass according to location (latitude and longitude). Then we will add a third axis (water depth) to visualize this in 3-dimensions using the plotly package. Since depth is measured in meters below sea level, we visualize this in negative values.

seagrass %>%
  ggplot() +
  geom_point(aes(x=LATITUDE, y=LONGITUDE, color=SPECIES)) +
  ggtitle("Australia, Northeast Coast (Great Barrier Reef)")

plot_ly(seagrass, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")

We can also see how our categorical predictors, sediment and seabed type, relate to the number of observed seagrass species:

seagrass %>% 
  ggplot(aes(SEDIMENT, fill=SPECIES)) +
  geom_bar(width=.5, position = "dodge")

seagrass %>% 
  ggplot(aes(TIDAL, fill=SPECIES)) +
  geom_bar(width=.5, position = "dodge")

Random Forest

Our first attempt at the classifier (to predict presence of any seagrass species) will use random forest. We will first partition the data set into a training and testing set. Since we have over 12,000 observations, we can allocate 75% of the data to the training set.

# train-test split
seagrass_train_ind <- createDataPartition(y = seagrass$SPECIES, p=0.75, list=FALSE)

train_set <- seagrass[seagrass_train_ind, ]
test_set <- seagrass[-seagrass_train_ind, ]

# fit RF using relevant features, with mtry~p/3
rf_fit <- randomForest(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT + TIDAL, 
                       data=train_set, mtry = 2)

# build vector of predictions
rf_pred <- predict(rf_fit, newdata = test_set, type = "response")

# view performance metrics
confusionMatrix(table(pred = rf_pred, true = test_set$SPECIES))
## Confusion Matrix and Statistics
## 
##             true
## pred         C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
##   C_SERRULAT        232         11         10         24
##   S_ISOETIFO          3          6          0          0
##   T_HEMPRICH          5          5        193          3
##   Z_CAPIRCOR         74          3          2       2523
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9548          
##                  95% CI : (0.9468, 0.9618)
##     No Information Rate : 0.8242          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8458          
##                                           
##  Mcnemar's Test P-Value : 4.663e-07       
## 
## Statistics by Class:
## 
##                      Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity                    0.73885          0.240000           0.94146
## Specificity                    0.98381          0.999022           0.99550
## Pos Pred Value                 0.83755          0.666667           0.93689
## Neg Pred Value                 0.97089          0.993841           0.99584
## Prevalence                     0.10149          0.008080           0.06626
## Detection Rate                 0.07498          0.001939           0.06238
## Detection Prevalence           0.08953          0.002909           0.06658
## Balanced Accuracy              0.86133          0.619511           0.96848
##                      Class: Z_CAPIRCOR
## Sensitivity                     0.9894
## Specificity                     0.8548
## Pos Pred Value                  0.9696
## Neg Pred Value                  0.9451
## Prevalence                      0.8242
## Detection Rate                  0.8154
## Detection Prevalence            0.8410
## Balanced Accuracy               0.9221

We see that our classification model works quite well, especially for T_HEMPRICH and Z_CAPRICOR, which have 85%+ sensitivity and specificity. However, this model got quite a low sensitivity for S_ISOETIFO. Recall from our data wrangling process that S_ISOETIFO had only about 100 “Yes” observations. Since we thus had low variation in the values of S_ISOETIFO relative to the other three seagrass species, this may have contributed to the low sensitivity.

We can visually assess our predicted values with true values of species to see how our model performed.

# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicted values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~rf_pred, type="scatter3d", mode="markers")

These plots look fairly similar across the training and test sets, for all four species.

Below we see that across sediment types, the species predictions (rf_pred) appear very similar to the true species distribution (SPECIES). The same is true across seabed types, in the two bar graphs further below.

test_set %>% 
  ggplot(aes(SEDIMENT, fill=SPECIES)) +
  geom_bar(width=.5, position = "dodge")

test_set %>% 
  ggplot(aes(SEDIMENT, fill=rf_pred)) +
  geom_bar(width=.5, position = "dodge")

test_set %>% 
  ggplot(aes(TIDAL, fill=SPECIES)) +
  geom_bar(width=.5, position = "dodge")

test_set %>% 
  ggplot(aes(TIDAL, fill=rf_pred)) +
  geom_bar(width=.5, position = "dodge")

Finally, we examine variable importance using the Gini coefficient:

variable_importance <- importance(rf_fit)

tmp <- data_frame(feature = rownames(variable_importance),
                  Gini = variable_importance[,1]) %>%
                  arrange(desc(Gini))

tmp %>% ggplot(aes(x=reorder(feature, Gini), y=Gini)) + 
  geom_bar(stat='identity') +
  coord_flip() + xlab("Feature") +
  theme(axis.text=element_text(size=8))

We see that longitude and latitude were very predictive of presence of seagrass followed by water depth. The types of sediment and seabed are not very important predictors. Thus, it seems that the location of where the sea grass was discovered matters more than the various ocean floor properties.

Multinomial logistic regression

We will now try a multinomial logistic regression model to see how it compares to the random forest we fit above. We will use the nnet package. The logistic regression model is as follows:

# fit model
multinom_fit <- multinom(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT, data=train_set)
## # weights:  44 (30 variable)
## initial  value 12874.515732 
## iter  10 value 3247.989443
## iter  20 value 2669.068404
## iter  30 value 2546.836953
## iter  40 value 2474.509681
## iter  50 value 2458.621441
## iter  60 value 2431.044865
## iter  70 value 2430.887870
## iter  80 value 2430.768859
## iter  90 value 2428.784786
## iter 100 value 2428.631883
## final  value 2428.631883 
## stopped after 100 iterations
summary(multinom_fit)
## Call:
## multinom(formula = SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT, 
##     data = train_set)
## 
## Coefficients:
##            (Intercept)  LATITUDE LONGITUDE       DEPTH SEDIMENTMud SEDIMENTReef
## S_ISOETIFO   -350.6681 2.2716446  3.014258 -0.03157577  -53.283832    -53.62843
## T_HEMPRICH   -133.9094 1.8104662  1.290479 -0.30495007  -25.017614    -22.20410
## Z_CAPIRCOR   -191.1202 0.6697678  1.472643 -0.79180158   -9.913188    -60.52610
##            SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO    -52.32849      -87.37784    -52.54337     -51.50610
## T_HEMPRICH    -21.68547      -20.77540    -22.06726     -22.15954
## Z_CAPIRCOR    -12.86058      -85.57933    -10.79864     -11.44234
## 
## Std. Errors:
##             (Intercept)   LATITUDE   LONGITUDE      DEPTH SEDIMENTMud
## S_ISOETIFO 0.0023636052 0.05375936 0.007463262 0.02633879   0.4056890
## T_HEMPRICH 0.0007594124 0.05381106 0.006821702 0.03759962   0.2780858
## Z_CAPIRCOR 0.0013655214 0.02863418 0.004343490 0.02906978   0.2949554
##            SEDIMENTReef SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO 9.114478e-01    0.7479946   3.716063e-14    0.3850719     0.9699091
## T_HEMPRICH 4.574640e-01    0.6054729   4.008084e-01    0.2488036     0.8517372
## Z_CAPIRCOR 1.547949e-21    0.5936871   1.011638e-32    0.2968752     0.7001492
## 
## Residual Deviance: 4857.264 
## AIC: 4911.264
# predicted probabilities
predicted_prob <- predict(multinom_fit, newdata=test_set, type="probs")

# predicted classes
predicted_class <- predict(multinom_fit, newdata=test_set, type="class")

# classification performance metrics
confusionMatrix(data = predicted_class, reference = test_set$SPECIES )
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
##   C_SERRULAT        140         14          7         26
##   S_ISOETIFO          1          0          0          0
##   T_HEMPRICH         11          2        187         17
##   Z_CAPIRCOR        162          9         11       2507
## 
## Overall Statistics
##                                           
##                Accuracy : 0.916           
##                  95% CI : (0.9056, 0.9255)
##     No Information Rate : 0.8242          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6921          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity                    0.44586         0.0000000           0.91220
## Specificity                    0.98309         0.9996742           0.98962
## Pos Pred Value                 0.74866         0.0000000           0.86175
## Neg Pred Value                 0.94014         0.9919172           0.99374
## Prevalence                     0.10149         0.0080802           0.06626
## Detection Rate                 0.04525         0.0000000           0.06044
## Detection Prevalence           0.06044         0.0003232           0.07014
## Balanced Accuracy              0.71448         0.4998371           0.95091
##                      Class: Z_CAPIRCOR
## Sensitivity                     0.9831
## Specificity                     0.6654
## Pos Pred Value                  0.9323
## Neg Pred Value                  0.8938
## Prevalence                      0.8242
## Detection Rate                  0.8103
## Detection Prevalence            0.8691
## Balanced Accuracy               0.8243

We see that our multinomial logistic model has about 91% overall accuracy, which is worse performance than the random forest. The model seems to predict T_HEMPRICH the best with 88.7% sensitivity and 98.8% specificity. It peforms well for Z_CAPRICOR as well, but performs relatively poorly for C_SERRULAT and even worse for S_ISOETIFO with 0% sensitivity. Again, we can assess our results visually:

# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicted values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~predicted_class, type="scatter3d", mode="markers")

In this case, the truth and prediction plots look different, particularly in that the prediction plot does not appear to show any instances of S_ISOETIFO. This makes sense, given the small sample size and low variability in the values of the S_ISOETIFO variable.

Below we see that across sediment types, the species predictions (rf_pred) appear fairly similar to the true species distribution (SPECIES). This is not true across seabed types, in the two bar graphs further below - there we see that S_ISOETIFO is never predicted, and C_SERRULAT is predicted less often than in the truth.

test_set %>% 
  ggplot(aes(SEDIMENT, fill=SPECIES)) +
  geom_bar(width=.5, position = "dodge")

test_set %>% 
  ggplot(aes(SEDIMENT, fill=predicted_class)) +
  geom_bar(width=.5, position = "dodge")

test_set %>% 
  ggplot(aes(TIDAL, fill=SPECIES)) + 
  geom_bar(width=.5, position = "dodge")

test_set %>% 
  ggplot(aes(TIDAL, fill=predicted_class)) + 
  geom_bar(width=.5, position = "dodge")

The overall accuracy for the multinomial logistic regression model was 90.9%, and that of the random forest model was 95.6%. While these may seem fairly close, the accuracy for the multinomial logistic model is deceiving since it performed particularly poorly in terms of sensitivity in 2 out of 4 classes.